Initial Exploration of the P/NBD Model
In this workbook we introduce the various different BTYD models, starting with a discussion of the underlying theory.
1 Background Theory
Before we start working on fitting and using the various Buy-Till-You-Die models, we first need to discuss the basic underlying theory and model.
In this model, we assume a customer becomes ‘alive’ to the business at the first purchase and then makes purchases stochastically but at a steady-rate for a period of time, and then ‘dies’ - i.e. becomes inactive to the business - hence the use of “Buy-Till-You-Die”.
Thus, at a high level these models decompose into modelling the transaction events using distributions such as the Poisson or Negative Binomial, and then modelling the ‘dropout’ process using some other method.
A number of BTYD models exist and for this workshop we will focus on the BG/NBD model - the Beta-Geometric Negative Binomial Distribution model (though we will discuss the P/NBD model also).
These models require only two pieces of information about each customer’s purchasing history: the “recency” (when the last transaction occurred) and “frequency” (the count of transactions made by that customer in a specified time period).
The notation used to represent this information is
\[ X = (x, \, t_x, \, T), \] where
\[ \begin{eqnarray*} x &=& \text{the number of transactions}, \\ T &=& \text{the observed time period}, \\ t_x &=& \text{the time since the last transaction}. \end{eqnarray*} \]
From this summary data we can fit most BTYD models.
2 BTYD Models
There are a number of different statistical approaches to building BTYD models - relying on a number of different assumptions about how the various recency, frequency and monetary values are modelled.
We now discuss a number of different ways of modelling this.
2.1 Pareto/Negative-Binomial Distribution (P/NBD) Model
The P/NBD model relies on five assumptions:
- While active, the number of transactions made by a customer follows a Poisson process with transaction rate \(\lambda\).
- Heterogeneity in \(\lambda\) follows a Gamma distribution \(\Gamma(\lambda \, | \, \alpha, r)\) with shape \(r\) and rate \(\alpha\).
- Each customer has an unobserved ‘lifetime’ of length \(\tau\). This point at which the customer becomes inactive is distributed as an exponential with dropout rate \(\mu\).
- Heterogeneity in dropout rates across customers follows a Gamma distribution \(\Gamma(\mu \, | \, s, \beta)\) with shape parameter \(s\) and rate parameter \(\beta\).
- The transaction rate \(\lambda\) and the dropout rate \(\mu\) vary independently across customers.
As before, we express this in mathematical notation as:
\[ \begin{eqnarray*} \lambda &\sim& \Gamma(\alpha, r), \\ \mu &\sim& \Gamma(s, \beta), \\ \tau &\sim& \text{Exponential}(\mu) \end{eqnarray*} \]
2.2 Beta-Geometric/Negative-Binomial Distribution (BG/NBD) Model
This model relies on a number of base assumptions, somewhat similar to the P/NBD model but modelling lifetime with a different process:
- While active, the number of transactions made by a customer follows a Poisson process with transaction rate \(\lambda\).
- Heterogeneity in \(\lambda\) follows a Gamma distribution \(\Gamma(\lambda \, | \, \alpha, r)\) with parameters shape \(r\) and rate \(\alpha\).
- After any transaction, a customer becomes inactive with probability \(p\).
- Heterogeneity in \(p\) follows a Beta distribution \(B(p \, | \, a, b)\) with shape parameters \(a\) and \(b\).
- The transaction rate \(\lambda\) and the dropout probability \(p\) vary independently across customers.
Note that it follows from the above assumptions that the probability of a customer being ‘alive’ after any transaction is given by the Geometric distribution, and hence the Beta-Geometric in the name.
To put this into more formal mathematical notation, we have:
\[ \begin{eqnarray*} \lambda &\sim& \Gamma(\alpha, r), \\ P(\text{alive}, k) &\sim& \text{Geometric}(p, k), \\ p &\sim& \text{Beta}(a, b) \end{eqnarray*} \]
3 Initial P/NBD Models
We start by modelling the P/NBD model using our synthetic datasets before we try to model real-life data.
Show code
use_fit_start_date <- as.Date("2020-01-01")
use_fit_end_date <- as.Date("2021-12-31")
use_valid_start_date <- as.Date("2022-01-01")
use_valid_end_date <- as.Date("2022-12-31")3.1 Load Short Time-frame Synthetic Data
We now want to load the short time-frame synthetic data.
Show code
customer_cohortdata_tbl <- read_rds("data/synthdata_shortframe_cohort_tbl.rds")
customer_cohortdata_tbl |> glimpse()Rows: 50,000
Columns: 4
$ customer_id <chr> "C202001_0001", "C202001_0002", "C202001_0003", "C20200…
$ cohort_qtr <chr> "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", …
$ cohort_ym <chr> "2020 01", "2020 01", "2020 01", "2020 01", "2020 01", …
$ first_tnx_date <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-0…
Show code
customer_simparams_tbl <- read_rds("data/synthdata_shortframe_simparams_tbl.rds")
customer_simparams_tbl |> glimpse()Rows: 50,000
Columns: 9
$ customer_id <chr> "C202001_0001", "C202001_0002", "C202001_0003", "C2020…
$ cohort_qtr <chr> "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1", "2020 Q1",…
$ cohort_ym <chr> "2020 01", "2020 01", "2020 01", "2020 01", "2020 01",…
$ first_tnx_date <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-01, 2020-…
$ customer_lambda <dbl> 0.12213805, 0.29987747, 0.31504009, 0.03856001, 0.1881…
$ customer_mu <dbl> 0.127118566, 0.096184402, 0.052334526, 0.204708842, 0.…
$ customer_tau <dbl> 29.5956480, 10.9437199, 1.6938450, 2.6798108, 48.75206…
$ customer_amtmn <dbl> 46.2371662, 111.0425353, 45.8870891, 43.0754249, 10.93…
$ customer_amtcv <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
Show code
customer_transactions_tbl <- read_rds("data/synthdata_shortframe_transactions_tbl.rds")
customer_transactions_tbl |> glimpse()Rows: 297,980
Columns: 4
$ customer_id <chr> "C202001_0028", "C202001_0006", "C202001_0012", "C202001…
$ tnx_timestamp <dttm> 2020-01-01 00:59:15, 2020-01-01 01:25:43, 2020-01-01 04…
$ invoice_id <chr> "T20200101-0001", "T20200101-0002", "T20200101-0003", "T…
$ tnx_amount <dbl> 269.07, 124.62, 20.81, 14.76, 16.02, 19.16, 211.91, 43.9…
We re-produce the visualisation of the transaction times we used in previous workbooks.
Show code
plot_tbl <- customer_transactions_tbl |>
group_nest(customer_id, .key = "cust_data") |>
filter(map_int(cust_data, nrow) > 3) |>
slice_sample(n = 30) |>
unnest(cust_data)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))3.2 Derive the Log-likelihood Function
We now turn our attention to deriving the log-likelihood model for the P/NBD model.
We assume that we know that a given customer has made \(x\) transactions after the initial one over an observed period of time \(T\), and we label these transactions \(t_1\), \(t_2\), …, \(t_x\).
To model the likelihood for this observation, we need to consider two possibilities: one for where the customer is still ‘alive’ at \(T\), and one where the customer has ‘died’ by \(T\).
In the first instance, the likelihood is the product of the observations of each transaction, multiplied by the likelihood of the customer still being alive at time \(T\).
Because we are modelling the transaction counts as a Poisson process, this corresponds to the times between events following an exponential distribution, and so both the transaction times and the lifetime likelihoods use the exponential.
This gives us:
\[ \begin{eqnarray*} L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) &=& \lambda e^{-\lambda t_1} \lambda e^{-\lambda(t_2 - t_1)} ... \lambda e^{-\lambda (t_x - t_{x-1})} e^{-\lambda(T - t)} \\ &=& \lambda^x e^{-\lambda T} \end{eqnarray*} \]
and we can combine this with the likelihood of the lifetime of the customer \(\tau\) being greater than the observation window \(T\),
\[ P(\tau > T \, | \, \mu) = e^{-\mu T} \]
For the second case, the customer becomes inactive at some time \(\tau\) in the interval \((t_x, T]\), and so the likelihood is
\[ \begin{eqnarray*} L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) &=& \lambda e^{-\lambda t_1} \lambda e^{-\lambda(t_2 - t_1)} ... \lambda e^{-\lambda (t_x - t_{x-1})} e^{-\lambda(\tau - t_x)} \\ &=& \lambda^x e^{-\lambda \tau} \end{eqnarray*} \]
In both cases we do not need the times of the individual transactions, and all we need are the values \((x, t_x, T)\).
As we cannot observe \(\tau\), we want to remove the conditioning on \(\tau\) by integrating it out.
\[ \begin{eqnarray*} L(\lambda, \mu \, | \, x, t_x, T) &=& L(\lambda \, | \, t_1, t_2, ..., t_x, T, \, \tau > T) \, P(\tau > T \, | \, \mu) + \int^T_{t_x} L(\lambda \, | \, x, T, \text{ inactive at } (t_x, T] ) \, f(\tau \, | \mu) d\tau \\ &=& \lambda^x e^{-\lambda T} e^{\mu T} + \lambda^x \int^T_{t_x} e^{-\lambda \tau} \mu e^{-\mu \tau} d\tau \\ &=& \lambda^x e^{-(\lambda + \mu)T} + \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} + \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) T} \\ &=& \frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} + \frac{\lambda^{x+1} \mu}{\lambda + \mu} e^{-(\lambda + \mu) T} \end{eqnarray*} \]
In Stan, we do not calculate the likelihoods but the Log-likelihood, so we need to take the log of this expression. This creates a problem, as we have no easy way to calculate \(\log(a + b)\). As this expression occurs a lot, Stan provides a log_sum_exp(), which is defined by
\[ \log \, (a + b) = \text{log_sum_exp}(\log a, \log b) \]
\[ \begin{eqnarray*} LL(\lambda, \mu \, | \, x, t_x, T) &=& \log \left( \frac{\lambda^x \, \mu}{\lambda + \mu} \left( e^{-(\lambda + \mu) t_x} + \lambda e^{-(\lambda + \mu) T} \right) \right) \\ &=& x \log \lambda + \log \mu -log(\lambda + \mu) + \text{log_sum_exp}(-(\lambda + \mu) \, t_x, \; \log \lambda - (\lambda + \mu) \, T) \end{eqnarray*} \]
This is the log-likelihood model we want to fit in Stan.
3.3 Construct Datasets
Having loaded the synthetic data we need to construct a number of datasets of derived values.
Show code
customer_summarystats_tbl <- customer_transactions_tbl |>
calculate_transaction_cbs_data(last_date = use_fit_end_date)
customer_summarystats_tbl |> glimpse()Rows: 33,326
Columns: 6
$ customer_id <chr> "C202001_0001", "C202001_0002", "C202001_0003", "C20200…
$ first_tnx_date <dttm> 2020-01-01 17:02:14, 2020-01-01 15:19:11, 2020-01-01 0…
$ last_tnx_date <dttm> 2020-05-31 14:28:54, 2020-03-05 04:15:05, 2020-01-08 1…
$ x <dbl> 3, 3, 2, 0, 9, 0, 0, 1, 0, 2, 5, 1, 3, 1, 1, 71, 7, 12,…
$ t_x <dbl> 21.556217, 9.076975, 1.054482, 0.000000, 48.249081, 0.0…
$ T_cal <dbl> 104.1843, 104.1945, 104.2446, 104.2266, 104.1636, 104.2…
As before, we construct a number of subsets of the data for use later on with the modelling and create some data subsets.
Show code
shuffle_tbl <- customer_summarystats_tbl |>
slice_sample(prop = 1, replace = FALSE)
id_50 <- shuffle_tbl |> head(50) |> pull(customer_id) |> sort()
id_1000 <- shuffle_tbl |> head(1000) |> pull(customer_id) |> sort()
id_5000 <- shuffle_tbl |> head(5000) |> pull(customer_id) |> sort()
id_10000 <- shuffle_tbl |> head(10000) |> pull(customer_id) |> sort()We then construct some fit data based on these values.
Show code
fit_1000_data_tbl <- customer_summarystats_tbl |> filter(customer_id %in% id_1000)
fit_1000_data_tbl |> glimpse()Rows: 1,000
Columns: 6
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C20200…
$ first_tnx_date <dttm> 2020-01-01 07:41:56, 2020-01-02 04:04:36, 2020-01-03 0…
$ last_tnx_date <dttm> 2020-05-08 11:18:39, 2020-01-22 08:40:44, 2020-03-29 1…
$ x <dbl> 2, 1, 1, 10, 1, 0, 11, 0, 7, 1, 13, 0, 2, 0, 2, 4, 0, 4…
$ t_x <dbl> 18.3072133, 2.8845382, 12.3386007, 40.7424467, 0.491829…
$ T_cal <dbl> 104.2399, 104.1186, 103.9738, 103.9929, 103.9709, 103.6…
Show code
fit_10000_data_tbl <- customer_summarystats_tbl |> filter(customer_id %in% id_10000)
fit_10000_data_tbl |> glimpse()Rows: 10,000
Columns: 6
$ customer_id <chr> "C202001_0004", "C202001_0007", "C202001_0009", "C20200…
$ first_tnx_date <dttm> 2020-01-01 09:55:23, 2020-01-01 13:54:58, 2020-01-01 1…
$ last_tnx_date <dttm> 2020-01-01 09:55:23, 2020-01-01 13:54:58, 2020-01-01 1…
$ x <dbl> 0, 0, 0, 0, 30, 2, 0, 1, 0, 21, 0, 0, 30, 55, 0, 0, 8, …
$ t_x <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 61.5535…
$ T_cal <dbl> 104.2266, 104.2029, 104.1982, 104.2202, 104.2108, 104.2…
Finally, we also want to recreate our transaction visualisation for the first 50 customers randomly selected.
Show code
plot_tbl <- customer_transactions_tbl |>
filter(customer_id %in% id_50)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))3.4 Write Data
Show code
id_1000 |> write_rds("data/shortframe_id_1000.rds")
id_5000 |> write_rds("data/shortframe_id_5000.rds")
id_10000 |> write_rds("data/shortframe_id_10000.rds")
fit_1000_data_tbl |> write_rds("data/fit_1000_shortframe_data_tbl.rds")
fit_10000_data_tbl |> write_rds("data/fit_10000_shortframe_data_tbl.rds")
customer_summarystats_tbl |> write_rds("data/customer_summarystats_shortframe_tbl.rds")4 Fit Initial P/NBD Model
We now construct our Stan model and prepare to fit it with our synthetic dataset.
Before we start on that, we set a few parameters for the workbook to organise our Stan code.
Show code
stan_modeldir <- "stan_models"
stan_codedir <- "stan_code"We also want to set a number of overall parameters for this workbook
To start the fit data, we want to use the 1,000 customers.
Show code
use_fit_data_tbl <- fit_1000_data_tblWe start with the Stan model.
functions {
#include util_functions.stan
}
data {
int<lower=1> n; // number of customers
vector<lower=0>[n] t_x; // time to most recent purchase
vector<lower=0>[n] T_cal; // total observation time
vector<lower=0>[n] x; // number of purchases observed
real<lower=0> lambda_mn; // prior mean for lambda
real<lower=0> lambda_cv; // prior cv for lambda
real<lower=0> mu_mn; // prior mean for mu
real<lower=0> mu_cv; // prior mean for mu
}
transformed data {
real<lower=0> r = 1 / (lambda_cv * lambda_cv);
real<lower=0> alpha = 1 / (lambda_cv * lambda_cv * lambda_mn);
real<lower=0> s = 1 / (mu_cv * mu_cv);
real<lower=0> beta = 1 / (mu_cv * mu_cv * mu_mn);
}
parameters {
vector<lower=0>[n] lambda; // purchase rate
vector<lower=0>[n] mu; // lifetime dropout rate
}
model {
// setting priors
lambda ~ gamma(r, alpha);
mu ~ gamma(s, beta);
target += calculate_pnbd_loglik(n, lambda, mu, x, t_x, T_cal);
}
generated quantities {
vector[n] p_alive; // Probability that they are still "alive"
p_alive = 1 ./ (1 + mu ./ (mu + lambda) .* (exp((lambda + mu) .* (T_cal - t_x)) - 1));
}
This file contains a few new features of Stan - named file includes and user-defined functions - calculate_pnbd_loglik. We look at this file here:
real calculate_pnbd_loglik(int n, vector lambda, vector mu,
data vector x, data vector t_x, data vector T_cal) {
// likelihood
vector[n] t1;
vector[n] t2;
vector[n] lpm;
vector[n] lht;
vector[n] rht;
lpm = lambda + mu;
lht = log(lambda) - lpm .* T_cal;
rht = log(mu) - lpm .* t_x;
t1 = x .* log(lambda) - log(lpm);
for (i in 1:n) {
t2[i] = log_sum_exp(lht[i], rht[i]);
}
return(sum(t1) + sum(t2));
}
real calculate_bgnbd_loglik(int n, vector lambda, vector p,
data vector x, data vector t_x, data vector T_cal) {
// likelihood
vector[n] t1;
vector[n] t2;
vector[n] lht;
vector[n] rht;
lht = log(p) + (x-1) .* log(1-p) + x .* log(lambda) - lambda .* t_x;
rht = x .* log(1-p) + x .* log(lambda) - lambda .* T_cal;
for(i in 1:n) {
t2[i] = log_sum_exp(lht[i], rht[i]);
}
return(sum(t2));
}
We now compile this model using CmdStanR.
Show code
pnbd_fixed_stanmodel <- cmdstan_model(
"stan_code/pnbd_fixed.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
Show code
stan_modelname <- "pnbd_init_fixed"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- use_fit_data_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 1.00,
mu_mn = 0.10,
mu_cv = 1.00,
)
pnbd_fixed_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4201,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)Running MCMC with 4 chains, at most 8 in parallel...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 finished in 23.0 seconds.
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 finished in 23.2 seconds.
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 23.4 seconds.
Chain 2 finished in 23.4 seconds.
All 4 chains finished successfully.
Mean chain execution time: 23.2 seconds.
Total execution time: 23.5 seconds.
Show code
pnbd_fixed_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_b…¹
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -1.60e+4 -1.60e+4 36.4 37.3 -1.61e+4 -1.59e+4 1.01 573.
2 lambda[1] 1.05e-1 8.95e-2 0.0673 0.0598 2.36e-2 2.34e-1 1.00 3358.
3 lambda[2] 2.07e-1 1.58e-1 0.173 0.128 2.84e-2 5.45e-1 0.999 3424.
4 lambda[3] 8.98e-2 7.03e-2 0.0716 0.0581 1.29e-2 2.27e-1 1.00 3151.
5 lambda[4] 2.28e-1 2.19e-1 0.0700 0.0689 1.26e-1 3.58e-1 1.00 4251.
6 lambda[5] 3.03e-1 2.44e-1 0.243 0.194 4.27e-2 7.49e-1 1.00 3853.
7 lambda[6] 1.42e-1 7.89e-2 0.172 0.0908 5.26e-3 4.79e-1 1.00 3129.
8 lambda[7] 4.79e-1 4.65e-1 0.140 0.138 2.74e-1 7.33e-1 0.999 4655.
9 lambda[8] 1.46e-1 8.34e-2 0.186 0.0939 4.67e-3 5.03e-1 1.00 3150.
10 lambda[9] 1.26e-1 1.20e-1 0.0482 0.0463 5.84e-2 2.13e-1 1.00 3334.
# … with 2,991 more rows, 1 more variable: ess_tail <num>, and abbreviated
# variable name ¹ess_bulk
We have some basic HMC-based validity statistics we can check.
Show code
pnbd_fixed_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
4.1 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
Show code
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_fixed_stanfit$draws(inc_warmup = TRUE) |>
mcmc_trace(
pars = parameter_subset,
n_warmup = 500
) +
ggtitle("Full Traceplots of Some Lambda and Mu Values")As the warmup is skewing the y-axis somewhat, we repeat this process without the warmup.
Show code
pnbd_fixed_stanfit$draws(inc_warmup = FALSE) |>
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
Show code
pnbd_fixed_stanfit |>
rhat(pars = c("lambda", "mu")) |>
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
Show code
pnbd_fixed_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
as.numeric() |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
Show code
pnbd_fixed_stanfit$draws() |>
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Values")As before, this first fit has a comprehensive run of fit diagnostics, but for the sake of brevity in later models we will show only the traceplots once we are satisfied with the validity of the sample.
4.2 Check Model Fit
As we are still working with synthetic data, we know the true values for each customer and so we can check how good our model is at recovering the true values on a customer-by-customer basis.
As in previous workbooks, we build our validation datasets and then check the distribution of \(q\)-values for both \(\lambda\) and \(\mu\) across the customer base.
Show code
pnbd_fixed_valid_lst <- create_pnbd_posterior_validation_data(
stanfit = pnbd_fixed_stanfit,
data_tbl = use_fit_data_tbl,
simparams_tbl = customer_simparams_tbl
)
pnbd_fixed_valid_lst$lambda_qval_plot |> plot()Show code
pnbd_fixed_valid_lst$mu_qval_plot |> plot()5 Fit Alternate Prior P/NBD Model
We now repeat all of the above but with an alternative set of priors and compare the outputs to give us an idea of the sensitivity of the inference to the input priors.
We will repeat this a few times, so we start by increasing the co-efficient of variation in the priors. We keep everything else constant, including the seed used by Stan.
Show code
stan_modelname <- "pnbd_init_fixed2"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- use_fit_data_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 2.00,
mu_mn = 0.10,
mu_cv = 2.00,
)
pnbd_fixed2_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4201,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)Running MCMC with 4 chains, at most 8 in parallel...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 52.7 seconds.
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 finished in 52.9 seconds.
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 finished in 53.6 seconds.
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 finished in 54.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 53.3 seconds.
Total execution time: 54.1 seconds.
Show code
pnbd_fixed2_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_b…¹
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -1.15e+4 -1.15e+4 40.4 3.89e+1 -1.16e+4 -1.14e+4 1.00 487.
2 lambda[1] 7.39e-2 5.67e-2 0.0612 4.86e-2 1.00e-2 1.97e-1 1.00 1381.
3 lambda[2] 1.74e-1 1.06e-1 0.197 1.18e-1 6.17e-3 5.78e-1 1.00 1309.
4 lambda[3] 4.82e-2 2.70e-2 0.0585 3.07e-2 1.86e-3 1.69e-1 1.00 1462.
5 lambda[4] 2.22e-1 2.14e-1 0.0748 7.33e-2 1.14e-1 3.59e-1 1.00 3001.
6 lambda[5] 4.49e-1 2.75e-1 0.510 3.05e-1 1.45e-2 1.52e+0 1.00 1095.
7 lambda[6] 5.99e-2 2.49e-3 0.217 3.70e-3 6.70e-8 2.97e-1 1.00 1401.
8 lambda[7] 5.12e-1 4.96e-1 0.157 1.52e-1 2.89e-1 8.00e-1 1.00 2740.
9 lambda[8] 5.36e-2 3.22e-3 0.178 4.77e-3 2.01e-7 2.83e-1 1.00 866.
10 lambda[9] 1.10e-1 1.04e-1 0.0470 4.70e-2 4.42e-2 1.95e-1 1.00 1728.
# … with 2,991 more rows, 1 more variable: ess_tail <num>, and abbreviated
# variable name ¹ess_bulk
We have some basic HMC-based validity statistics we can check.
Show code
pnbd_fixed2_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed2-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed2-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed2-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed2-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
The following parameters had split R-hat greater than 1.05:
p_alive[56], p_alive[90], p_alive[115], p_alive[289], p_alive[399], p_alive[622]
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizating your model with additional prior information or a more effective parameterization.
Processing complete.
5.1 Visual Diagnostics of the Sample Validity
We do not repeat the full set of validation checks here, but look at the plot of effective stepsizes.
Show code
pnbd_fixed2_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
as.numeric() |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes for Alternate Priors")5.2 Check Model Fit
As we are still working with synthetic data, we know the true values for each customer and so we can check how good our model is at recovering the true values on a customer-by-customer basis.
As in previous workbooks, we build our validation datasets and then check the distribution of \(q\)-values for both \(\lambda\) and \(\mu\) across the customer base.
Show code
pnbd_fixed2_valid_lst <- create_pnbd_posterior_validation_data(
stanfit = pnbd_fixed2_stanfit,
data_tbl = use_fit_data_tbl,
simparams_tbl = customer_simparams_tbl
)
pnbd_fixed2_valid_lst$lambda_qval_plot |> plot()Show code
pnbd_fixed2_valid_lst$mu_qval_plot |> plot()We now construct some error-bar plots for a subset of the customers to get an idea of the differences across the two posteriors.
Show code
comparison_plot_tbl <- list(
fixed = pnbd_fixed_valid_lst$validation_tbl,
fixed2 = pnbd_fixed2_valid_lst$validation_tbl
) |>
bind_rows(.id = "post_label") |>
group_by(post_label, customer_id) |>
summarise(
.groups = "drop",
p10 = quantile(post_lambda, 0.10),
p25 = quantile(post_lambda, 0.25),
p50 = quantile(post_lambda, 0.50),
p75 = quantile(post_lambda, 0.75),
p90 = quantile(post_lambda, 0.90),
customer_lambda = customer_lambda |> unique()
) |>
group_nest(customer_id) |>
slice_sample(n = 50) |>
unnest(data)
ggplot(comparison_plot_tbl) +
geom_point(aes(x = customer_id, y = customer_lambda)) +
geom_errorbar(
aes(x = customer_id, ymin = p25, ymax = p75, colour = post_label),
position = position_dodge(width = 0.75), width = 0, size = 3,
) +
geom_errorbar(
aes(x = customer_id, ymin = p10, ymax = p90, colour = post_label),
position = position_dodge(width = 0.75), width = 0, size = 1,
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5)
)6 Assessing the P/NBD Models
We now focus on assess the various versions of our models without the benefit of knowing the ‘true’ answer as we will not have this information in our real-world applications.
A key derived quantity to employ is \(p_{alive}(T)\), the probability that the customer is still ‘alive’ at the point of observation of the data.
6.1 Calculating p_alive
The probability that a customer with purchase history \((x, t_x, T)\) is ‘alive’ at time \(T\) is the probability that the (unobserved) time at which he becomes inactive (\(\tau\)) occurs after T, that is \(P(\tau > T)\).
We apply Bayes’ Theorem to give us:
\[ \begin{eqnarray*} P(\tau > T \, | \, \lambda, \mu, x, t_x, T) &=& \frac{L(\lambda \, | \, x, T, \tau > T) \, P(\tau > T \, | \, \mu)} {L(\lambda, \mu \, | \, x, t_x, T)} \\ &=& \frac{\lambda^x e^{−(λ+μ)T}} {L(\lambda, \mu \, | \, x, t_x, T)} \end{eqnarray*} \]
We know the likelihood for this model, so can now substitute this in:
\[ \begin{eqnarray*} P(\tau > T \, | \, \lambda, \mu, x, t_x, T) &=& \frac{\lambda^x e^{−(λ+μ)T}} {\frac{\lambda^x \mu}{\lambda + \mu} e^{-(\lambda + \mu) t_x} + \frac{\lambda^{x+1} \mu}{\lambda + \mu} e^{-(\lambda + \mu) T}} \\ &=& \frac{\lambda^x \, e^{-(\lambda + \mu)T}} {\lambda^x e^{-(\lambda + \mu)T} \{1 + (\frac{u}{\lambda + \mu}) [e^{-(\lambda + \mu)(t_x - T)} - 1 ]\}} \\ &=& \frac{1}{1 + (\frac{u}{\lambda + \mu}) [e^{(\lambda + \mu)(T - t_x)} - 1 ]} \end{eqnarray*} \]
We now want to verify this calculation in our posterior estimates. To do this, we take a number of different cohorts of customers, and visually inspect our transaction visualisation.
Show code
pnbd_fixed_palive_summary_tbl <- pnbd_fixed_valid_lst$validation_tbl |>
group_by(customer_id) |>
summarise(
.groups = "drop",
p_alive_p10 = quantile(p_alive, 0.10),
p_alive_p25 = quantile(p_alive, 0.25),
p_alive_p50 = quantile(p_alive, 0.50),
p_alive_p75 = quantile(p_alive, 0.75),
p_alive_p90 = quantile(p_alive, 0.90),
p_alive_range50 = p_alive_p75 - p_alive_p25,
p_alive_range80 = p_alive_p90 - p_alive_p10,
)
pnbd_fixed_palive_summary_tbl |> glimpse()Rows: 1,000
Columns: 8
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C2020…
$ p_alive_p10 <dbl> 2.030910e-10, 1.126283e-26, 1.599320e-11, 1.144730e-09…
$ p_alive_p25 <dbl> 5.606343e-08, 4.755042e-19, 1.066430e-08, 2.955800e-08…
$ p_alive_p50 <dbl> 8.077040e-06, 1.022550e-12, 6.479560e-06, 7.548325e-07…
$ p_alive_p75 <dbl> 4.621550e-04, 3.071232e-08, 8.785598e-04, 1.517572e-05…
$ p_alive_p90 <dbl> 9.452371e-03, 2.280342e-05, 1.744547e-02, 1.499709e-04…
$ p_alive_range50 <dbl> 4.620989e-04, 3.071232e-08, 8.785491e-04, 1.514617e-05…
$ p_alive_range80 <dbl> 9.452371e-03, 2.280342e-05, 1.744547e-02, 1.499698e-04…
We now take the customers that are highly likely to no longer be active, and highly likely to be active, and so we use the 10% and 90% percentiles.
Show code
likely_active_id <- pnbd_fixed_palive_summary_tbl |>
filter(p_alive_p10 > 0.95) |>
pull(customer_id)
plot_tbl <- customer_transactions_tbl |>
filter(
tnx_timestamp < as.Date("2022-01-01"),
customer_id %in% likely_active_id
)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
geom_vline(aes(xintercept = as.POSIXct("2019-01-01")), colour = "red") +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Transaction Times for Likely Active Customers"
) +
theme(axis.text.y = element_text(size = 10))This is useful, but the longer period of time prevents us from seeing the more recent transactions clearly, so we focus on the final year.
Show code
likely_active_id <- pnbd_fixed_palive_summary_tbl |>
filter(p_alive_p10 > 0.95) |>
pull(customer_id)
plot_tbl <- customer_transactions_tbl |>
filter(
tnx_timestamp < as.Date("2022-01-01"),
tnx_timestamp >= as.Date("2021-01-01"),
customer_id %in% likely_active_id
)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
geom_vline(aes(xintercept = as.POSIXct("2022-01-01")), colour = "red") +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Transaction Times for Likely Active Customers"
) +
theme(axis.text.y = element_text(size = 10))We see that most of the active customers have reasonably recent transactions.
Show code
likely_inactive_id <- pnbd_fixed_palive_summary_tbl |>
filter(p_alive_p90 < 0.05) |>
pull(customer_id)
plot_tbl <- customer_transactions_tbl |>
filter(
tnx_timestamp < as.Date("2022-01-01"),
customer_id %in% likely_inactive_id
)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line(alpha = 0.1) +
geom_point(alpha = 0.1, size = 1) +
geom_vline(aes(xintercept = as.POSIXct("2022-01-01")), colour = "red") +
labs(
x = "Date",
y = "Customer",
title = "Visualisation of Transaction Times for Likely Inactive Customers"
) +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)Finally, we want to look at the customers with the most uncertainty in their value for p_alive. There are a number of ways to do this, but we start by looking at the range in values.
First we look at the distribution of these ranges.
Show code
ggplot(pnbd_fixed_palive_summary_tbl) +
geom_histogram(aes(x = p_alive_range80), binwidth = 0.02) +
labs(
x = "80% Interval Range",
y = "Frequency",
title = "Distribution of Range of Values for the 80% Credibility Range for p_alive"
)We see that most customers have their credibility intervals in a reasonably tight range, and so we can just look at customers where this range is at least 0.3.
Show code
plot_tbl <- pnbd_fixed_palive_summary_tbl |>
filter(p_alive_range80 > 0.3) |>
select(customer_id, p_alive_range80) |>
inner_join(customer_transactions_tbl, by = "customer_id") |>
filter(
tnx_timestamp < as.Date("2022-01-01"),
tnx_timestamp >= as.Date("2021-01-01")
)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id, colour = p_alive_range80)) +
geom_line(alpha = 0.5) +
geom_point(alpha = 0.5, size = 1) +
geom_vline(aes(xintercept = as.POSIXct("2022-01-01")), colour = "red") +
scale_colour_gradient(low = "blue", high = "red") +
labs(
x = "Date",
y = "Customer",
colour = "Interval Range",
title = "Visualisation of Transaction Times for Non-Confident p_alive Customers"
) +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)6.2 Constructing Model Validation Approaches.
We now can combine each of these three inferred parameters, \(\lambda\), \(\mu\), p_alive as an input to data generating simulations and compare the observed data to our simulated data.
Recall that our models were fit on data that occurred before a specific date: for this data the cutoff date is 2018-12-31.
We can approach this generative process in a number of different ways, but the most natural appears to be the following (repeated for each customer in the training dataset):
- Using
p_alive, simulate if the customer is still active. - If active, simulate \(\tau'\), the remaining lifetime from observed \(T\) using \(\mu\) – \(\tau' \sim \text{Exponential}(\mu)\).
- Simulate times \(t_x'\) between events from \(T\), \(t_x' \sim \text{Exponential}(\lambda)\).
- Keep all values where the cumulative sum of these time intervals is less than \(\tau\).
Note that the above approach is simplified due to the ‘memoryless’ nature of the Exponential distribution. Also, we could just simulate event counts using \(x' \sim \text{Poisson}(\lambda \tau')\) but it is more useful to also simulate time intervals between the transactions.
To perform this simulation, we write a function generate_pnbd_validation_transactions() to perform the required calculations. Thinking ahead, this function will also include parameters for the transaction amount, but for these models we set both those values to 1.
Show code
plan(multisession)
fit_label <- "pnbd_init_fixed"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
pnbd_fixed_validsims_tbl <- pnbd_fixed_valid_lst$validation_tbl |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_{fit_label}_{customer_id}.rds"
),
chunk_data = future_map2(
sim_file, cust_params,
run_pnbd_simulations_chunk,
start_dttm = as.POSIXct(use_valid_start_date),
end_dttm = as.POSIXct(use_valid_end_date),
.options = furrr_options(
globals = c(
"rbernoulli", "calculate_event_times", "rgamma_mucv",
"gamma_mucv2shaperate", "generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = Inf,
seed = 421
),
.progress = TRUE
)
) |>
select(-cust_params) |>
unnest(chunk_data)
pnbd_fixed_validsims_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C202001_0…
$ sim_file <glue> "precompute/pnbd_init_fixed/sims_pnbd_init_fixed_C202001_…
$ chunk_data <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
We also want the summary statistics for the transactions in the final year that is not included in the fitted data.
Show code
obs_2022_stats_tbl <- customer_transactions_tbl |>
filter(
tnx_timestamp >= as.POSIXct("2022-01-01")
) |>
group_by(customer_id) |>
summarise(
.groups = "drop",
tnx_count = n(),
first_tnx = min(tnx_timestamp),
last_tnx = max(tnx_timestamp)
)
obs_2022_stats_tbl |> glimpse()Rows: 22,614
Columns: 4
$ customer_id <chr> "C202001_0016", "C202001_0040", "C202001_0043", "C202001_0…
$ tnx_count <int> 38, 10, 27, 4, 27, 2, 2, 4, 24, 18, 11, 9, 4, 8, 6, 4, 7, …
$ first_tnx <dttm> 2022-01-05 02:02:07, 2022-01-13 14:03:39, 2022-01-03 05:2…
$ last_tnx <dttm> 2022-12-28 21:49:13, 2022-06-29 20:33:29, 2022-09-27 02:4…
We now want to combine this data to start our investigation of the model outputs. We then compare the distribution of simulated transaction counts for our posterior against the observed count in the data by calculating the \(q\)-value.
Show code
pnbd_fixed_1000_validation_tbl <- pnbd_fixed_validsims_tbl |>
group_by(customer_id) |>
summarise(
.groups = "drop",
sim_count = list(map_int(sim_file, ~ .x |> read_rds() |> nrow()))
) |>
left_join(obs_2022_stats_tbl, by = "customer_id") |>
replace_na(list(tnx_count = 0)) |>
mutate(
tnx_count_qval = map2_dbl(sim_count, tnx_count, ~ ecdf(.x)(.y))
)
pnbd_fixed_1000_validation_tbl |> glimpse()Rows: 1,000
Columns: 6
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C20200…
$ sim_count <list> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
$ tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_tnx <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2022-03-27 02:…
$ last_tnx <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2022-12-01 09:…
$ tnx_count_qval <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
We first look at the distribution of these \(q\)-values.
Show code
ggplot(pnbd_fixed_1000_validation_tbl) +
geom_histogram(aes(x = tnx_count_qval), bins = 50) +
labs(
x = "q-Value for Transaction Count",
y = "Frequency",
title = "Histogram of Transaction Count ECDF q-Values"
)q-values of 1.0 are inflated in this data as we have a number of customers that are marked to be highly likely to be inactive and so have a posterior distribution of all zero count transactions and observed no transactions in the validation set, so we can remove these data points.
Show code
pnbd_fixed_1000_validation_tbl |>
filter(
!(tnx_count == 0 & (tnx_count_qval == 1))
) |>
filter(tnx_count == 0) |>
glimpse()Rows: 822
Columns: 6
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C20200…
$ sim_count <list> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
$ tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_tnx <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ last_tnx <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ tnx_count_qval <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
From inspection on this, we see there are a number of customers that have no observed transactions but also have a \(q\)-value of less than 1. In many cases this is fine, but we also see from the data there are a number of customers that have values close to 1 suggestion the simulations are almost all zero but also have a small count of simulations with at least one transaction.
This needs some further exploration, so we compare these customers to our summary data on the posterior distributions.
Show code
pnbd_fixed_1000_validation_tbl |>
filter(
!(tnx_count == 0 & (tnx_count_qval == 1))
) |>
filter(tnx_count == 0) |>
inner_join(pnbd_fixed_palive_summary_tbl, by = "customer_id") |>
glimpse()Rows: 822
Columns: 13
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C2020…
$ sim_count <list> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,…
$ tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ first_tnx <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ last_tnx <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ tnx_count_qval <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ p_alive_p10 <dbl> 2.030910e-10, 1.126283e-26, 1.599320e-11, 1.144730e-09…
$ p_alive_p25 <dbl> 5.606343e-08, 4.755042e-19, 1.066430e-08, 2.955800e-08…
$ p_alive_p50 <dbl> 8.077040e-06, 1.022550e-12, 6.479560e-06, 7.548325e-07…
$ p_alive_p75 <dbl> 4.621550e-04, 3.071232e-08, 8.785598e-04, 1.517572e-05…
$ p_alive_p90 <dbl> 9.452371e-03, 2.280342e-05, 1.744547e-02, 1.499709e-04…
$ p_alive_range50 <dbl> 4.620989e-04, 3.071232e-08, 8.785491e-04, 1.514617e-05…
$ p_alive_range80 <dbl> 9.452371e-03, 2.280342e-05, 1.744547e-02, 1.499698e-04…
Show code
check_id <- pnbd_fixed_1000_validation_tbl |>
filter(
!(tnx_count == 0 & (tnx_count_qval == 1))
) |>
filter(tnx_count == 0) |>
slice(1) |>
pull(customer_id)Looking at this data identifies some unusual aspects to how customers with values of \(p_alive\) that are essentially zero are still producing non-zero transaction counts in our simulation, so we need to look at this.
In particular, we look at the first customer in this summary dataset, customer print(check_id).
Show code
pnbd_fixed_validsims_tbl |>
filter(
customer_id == check_id
) |>
mutate(data = map(sim_file, read_rds)) |>
unnest(data) |>
select(customer_id, post_lambda, post_mu, p_alive) |>
arrange(desc(p_alive)) |>
glimpse()Rows: 2,000
Columns: 4
$ customer_id <chr> "C202001_0025", "C202001_0025", "C202001_0025", "C202001_0…
$ post_lambda <dbl> 0.01297330, 0.02220370, 0.01123920, 0.01493750, 0.04207110…
$ post_mu <dbl> 3.90604e-05, 6.72488e-05, 9.30786e-04, 2.36233e-03, 6.4926…
$ p_alive <dbl> 0.993856, 0.982850, 0.876301, 0.681524, 0.632113, 0.611881…
We see these values come from a very low value of both \(\lambda\) and \(\mu\), so also want to see the values of the posterior that are more typical of the customer by looking at values around the median of p_alive.
Show code
pnbd_fixed_validsims_tbl |>
filter(
customer_id == check_id
) |>
mutate(data = map(sim_file, read_rds)) |>
unnest(data) |>
select(customer_id, draw_id, post_lambda, post_mu, p_alive) |>
filter(
abs(1 - (p_alive / median(p_alive))) < 0.2
) |>
arrange(draw_id) |>
glimpse()Rows: 54
Columns: 5
$ customer_id <chr> "C202001_0025", "C202001_0025", "C202001_0025", "C202001_0…
$ draw_id <int> 22, 98, 117, 140, 165, 166, 452, 469, 562, 602, 627, 718, …
$ post_lambda <dbl> 0.1441230, 0.0819646, 0.0991096, 0.0774526, 0.0872856, 0.0…
$ post_mu <dbl> 0.01745440, 0.06276900, 0.05043940, 0.06769770, 0.05839630…
$ p_alive <dbl> 8.63728e-06, 9.14855e-06, 7.77731e-06, 8.20773e-06, 9.1233…
We also want to look at all the transactions for that customer.
Show code
customer_transactions_tbl |>
filter(customer_id == check_id)# A tibble: 3 × 4
customer_id tnx_timestamp invoice_id tnx_amount
<chr> <dttm> <chr> <dbl>
1 C202001_0025 2020-01-01 07:41:56 T20200101-0007 212.
2 C202001_0025 2020-01-11 08:11:53 T20200111-0020 22.6
3 C202001_0025 2020-05-08 11:18:39 T20200508-0068 77.0
Finally, we want to check the posterior distribution for the alternate prior model and see what this looks like.
Show code
pnbd_fixed2_valid_lst$validation_tbl |>
filter(customer_id == check_id) |>
select(customer_id, draw_id, post_lambda, post_mu, p_alive) |>
arrange(desc(p_alive))# A tibble: 2,000 × 5
customer_id draw_id post_lambda post_mu p_alive
<chr> <int> <dbl> <dbl> <dbl>
1 C202001_0025 1240 0.0364 1.89e-10 1
2 C202001_0025 1241 0.0201 6.04e-10 1
3 C202001_0025 1242 0.0302 1.10e-10 1
4 C202001_0025 1243 0.0166 1.22e-10 1
5 C202001_0025 1244 0.0170 1.43e- 9 1
6 C202001_0025 1246 0.00910 2.73e-10 1
7 C202001_0025 1247 0.0123 2.71e- 9 1
8 C202001_0025 1536 0.0234 1.80e- 9 1
9 C202001_0025 1582 0.0378 2.11e-10 1
10 C202001_0025 1846 0.00800 1.29e- 9 1
# … with 1,990 more rows
We see this gives us silly values at least for low transaction count customers, especially for those customers with no extra transactions after the initial one.
That said, these outputs are useful
Show code
fit_label <- "pnbd_init_fixed2"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
pnbd_fixed2_validsims_tbl <- pnbd_fixed2_valid_lst$validation_tbl |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_{fit_label}_{customer_id}.rds"
),
chunk_data = future_map2(
sim_file, cust_params,
run_pnbd_simulations_chunk,
start_dttm = as.POSIXct(use_valid_start_date),
end_dttm = as.POSIXct(use_valid_end_date),
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = FALSE,
seed = 421
),
.progress = TRUE
)
) |>
select(-cust_params) |>
unnest(chunk_data)
pnbd_fixed2_validsims_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C202001_0…
$ sim_file <glue> "precompute/pnbd_init_fixed2/sims_pnbd_init_fixed2_C20200…
$ chunk_data <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
6.3 Comparing Simulation Data
Before we move on to other aspects of model validation, we also want to check the simulation of transaction counts against the observed counts.
6.3.1 Assessing the pnbd_fixed Model
We first want to look at the total transaction count and count of customers transacted, comparing the observed amount to that observed in the actual data.
Show code
tnx_data_tbl <- obs_2022_stats_tbl |>
semi_join(pnbd_fixed_validsims_tbl, by = "customer_id")
obs_customer_count <- tnx_data_tbl |> nrow()
obs_total_tnx_count <- tnx_data_tbl |> pull(tnx_count) |> sum()
pnbd_fixed_tnx_simsumm_tbl <- pnbd_fixed_validsims_tbl |>
mutate(data = map(sim_file, read_rds)) |>
unnest(data) |>
group_by(draw_id) |>
summarise(
.groups = "drop",
sim_customer_count = length(sim_tnx_count[sim_tnx_count > 0]),
sim_total_tnx_count = sum(sim_tnx_count)
)
ggplot(pnbd_fixed_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_customer_count), binwidth = 1) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Simulated Customers With Transactions",
y = "Frequency",
title = "Histogram of Count of Customers Transacted",
subtitle = "Observed Count in Red"
)Show code
ggplot(pnbd_fixed_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_total_tnx_count), binwidth = 5) +
geom_vline(aes(xintercept = obs_total_tnx_count), colour = "red") +
labs(
x = "Simulated Transaction Count",
y = "Frequency",
title = "Histogram of Count of Total Transaction Count",
subtitle = "Observed Count in Red"
)6.3.2 Assessing the pnbd_fixed2 Model
We repeat this assessment for our pnbd_fixed2 model, which had a wider coefficient of variation in our prior parameters.
Show code
tnx_data_tbl <- obs_2022_stats_tbl |>
semi_join(pnbd_fixed2_validsims_tbl, by = "customer_id")
obs_customer_count <- tnx_data_tbl |> nrow()
obs_total_tnx_count <- tnx_data_tbl |> pull(tnx_count) |> sum()
pnbd_fixed2_tnx_simsumm_tbl <- pnbd_fixed2_validsims_tbl |>
mutate(data = map(sim_file, read_rds)) |>
unnest(data) |>
group_by(draw_id) |>
summarise(
.groups = "drop",
sim_customer_count = length(sim_tnx_count[sim_tnx_count > 0]),
sim_total_tnx_count = sum(sim_tnx_count)
)
ggplot(pnbd_fixed2_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_customer_count), binwidth = 1) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Simulated Customers With Transactions",
y = "Frequency",
title = "Histogram of Count of Customers Transacted",
subtitle = "Observed Count in Red"
)Show code
ggplot(pnbd_fixed2_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_total_tnx_count), binwidth = 5) +
geom_vline(aes(xintercept = obs_total_tnx_count), colour = "red") +
labs(
x = "Simulated Transaction Count",
y = "Frequency",
title = "Histogram of Count of Total Transaction Count",
subtitle = "Observed Count in Red"
)We see that in general, our models are over-estimating the count of customers transacting in the following year, as well as the total count of transactions made.
This suggests we should try an alternative prior, as our priors on both lifetime and transaction rate may skew too high.
7 Fitting Lower-CV Prior Model
Our existing fits have a number of different issues, so it is worth trying to refit this model with a lower coefficient of variation – this should solve the lower end of the distribution though the tradeoff means that that the right tail of the distribution is less pronounced. Given we are skewing high in our transaction counts though, this may not be a problem.
Show code
stan_modelname <- "pnbd_init_fixed3"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- use_fit_data_tbl |>
select(customer_id, x, t_x, T_cal) |>
compose_data(
lambda_mn = 0.25,
lambda_cv = 0.60,
mu_mn = 0.10,
mu_cv = 0.60,
)
pnbd_fixed3_stanfit <- pnbd_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4201,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)Running MCMC with 4 chains, at most 8 in parallel...
Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1 finished in 20.1 seconds.
Chain 4 finished in 19.8 seconds.
Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
Chain 2 finished in 20.1 seconds.
Chain 3 finished in 20.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 20.0 seconds.
Total execution time: 20.4 seconds.
Show code
pnbd_fixed3_stanfit$summary()# A tibble: 3,001 × 10
variable mean median sd mad q5 q95 rhat ess_b…¹
<chr> <num> <num> <num> <num> <num> <num> <num> <num>
1 lp__ -26448. -2.64e+4 33.0 32.6 -2.65e+4 -2.64e+4 1.01 530.
2 lambda[1] 0.140 1.28e-1 0.0672 0.0649 5.14e-2 2.65e-1 1.00 4219.
3 lambda[2] 0.222 1.98e-1 0.119 0.114 7.20e-2 4.53e-1 1.00 3329.
4 lambda[3] 0.135 1.22e-1 0.0713 0.0643 4.05e-2 2.71e-1 1.00 3503.
5 lambda[4] 0.230 2.22e-1 0.0670 0.0643 1.31e-1 3.47e-1 1.00 5338.
6 lambda[5] 0.273 2.45e-1 0.152 0.138 8.43e-2 5.62e-1 1.00 4218.
7 lambda[6] 0.194 1.69e-1 0.123 0.109 4.48e-2 4.32e-1 1.00 3063.
8 lambda[7] 0.425 4.15e-1 0.122 0.118 2.44e-1 6.44e-1 1.00 4840.
9 lambda[8] 0.195 1.73e-1 0.117 0.102 5.16e-2 4.28e-1 1.00 2692.
10 lambda[9] 0.141 1.36e-1 0.0463 0.0457 7.13e-2 2.25e-1 1.00 3671.
# … with 2,991 more rows, 1 more variable: ess_tail <num>, and abbreviated
# variable name ¹ess_bulk
We have some basic HMC-based validity statistics we can check.
Show code
pnbd_fixed3_stanfit$cmdstan_diagnose()Processing csv files: /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed3-1.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed3-2.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed3-3.csvWarning: non-fatal error reading adaptation data
, /home/rstudio/btydwork/stan_models/fit_pnbd_init_fixed3-4.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
7.1 Visual Diagnostics of the Sample Validity
We do not repeat the full set of validation checks here, but look at the plot of the traces, effective stepsizes, and the autocorrelation.
Show code
pnbd_fixed3_stanfit$draws() |>
mcmc_trace(pars = parameter_subset) +
ggtitle("Traceplot of Sample Parameters")Show code
pnbd_fixed3_stanfit |>
neff_ratio(pars = c("lambda", "mu")) |>
as.numeric() |>
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes for Low-CV Priors")Show code
pnbd_fixed3_stanfit$draws() |>
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Parameters")7.2 Comparing the Model Validation
We now construct our posterior dataset.
Show code
pnbd_fixed3_validation_tbl <- pnbd_fixed3_stanfit |>
recover_types(use_fit_data_tbl) |>
spread_draws(lambda[customer_id], mu[customer_id], p_alive[customer_id]) |>
ungroup() |>
select(customer_id, draw_id = .draw, post_lambda = lambda, post_mu = mu, p_alive)
pnbd_fixed3_validation_tbl |> glimpse()Rows: 2,000,000
Columns: 5
$ customer_id <chr> "C202001_0025", "C202001_0025", "C202001_0025", "C202001_0…
$ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ post_lambda <dbl> 0.0975828, 0.1985570, 0.1310230, 0.2234610, 0.0694294, 0.1…
$ post_mu <dbl> 0.0434344, 0.0623800, 0.0347672, 0.0780704, 0.0653311, 0.0…
$ p_alive <dbl> 1.77280e-05, 7.64335e-10, 3.09798e-06, 2.15607e-11, 1.9282…
We now want to check that customer from before to see if the posterior has any anomalous values.
Show code
pnbd_fixed3_validation_tbl |>
filter(customer_id == "C201212_0330") |>
select(customer_id, draw_id, post_lambda, post_mu, p_alive) |>
arrange(desc(p_alive))# A tibble: 0 × 5
# … with 5 variables: customer_id <chr>, draw_id <int>, post_lambda <dbl>,
# post_mu <dbl>, p_alive <dbl>
The customer has a very low posterior value for p_alive.
7.3 Generate Model Validation Simulations
We now want to generate our validation simulations using our existing functions, like we did for previous models.
Show code
fit_label <- "pnbd_init_fixed3"
precompute_dir <- glue("precompute/{fit_label}")
ensure_exists_precompute_directory(precompute_dir)[1] TRUE
Show code
pnbd_fixed3_validsims_tbl <- pnbd_fixed3_validation_tbl |>
group_nest(customer_id, .key = "cust_params") |>
mutate(
sim_file = glue(
"{precompute_dir}/sims_{fit_label}_{customer_id}.rds"
),
chunk_data = future_map2(
sim_file, cust_params,
run_pnbd_simulations_chunk,
start_dttm = as.POSIXct(use_valid_start_date),
end_dttm = as.POSIXct(use_valid_end_date),
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = FALSE,
seed = 421
),
.progress = TRUE
)
) |>
select(-cust_params) |>
unnest(chunk_data)
pnbd_fixed3_validsims_tbl |> glimpse()Rows: 1,000
Columns: 3
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C202001_0…
$ sim_file <glue> "precompute/pnbd_init_fixed3/sims_pnbd_init_fixed3_C20200…
$ chunk_data <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
Having constructed the validation posterior we now join this data to our observations and calculate the \(q\)-values as before.
Show code
pnbd_fixed3_1000_validation_tbl <- pnbd_fixed3_validsims_tbl |>
group_by(customer_id) |>
summarise(
.groups = "drop",
sim_count = list(map_int(sim_file, ~ .x |> read_rds() |> nrow()))
) |>
left_join(obs_2022_stats_tbl, by = "customer_id") |>
replace_na(list(tnx_count = 0)) |>
mutate(
tnx_count_qval = map2_dbl(sim_count, tnx_count, ~ ecdf(.x)(.y))
)
pnbd_fixed3_1000_validation_tbl |> glimpse()Rows: 1,000
Columns: 6
$ customer_id <chr> "C202001_0025", "C202001_0084", "C202001_0091", "C20200…
$ sim_count <list> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
$ tnx_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_tnx <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2022-03-27 02:…
$ last_tnx <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2022-12-01 09:…
$ tnx_count_qval <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Once again, for now we want to remove all the customers with no transactions and where all their validation simulations have zero transactions.
Show code
plot_tbl <- pnbd_fixed3_1000_validation_tbl |>
filter(!(tnx_count == 0 & tnx_count_qval == 1))
ggplot(plot_tbl) +
geom_histogram(aes(x = tnx_count_qval), bins = 50) +
labs(
x = "q-Value",
y = "Frequency",
title = "Histogram of q-Values From Posterior Simulation of Transactions Counts"
)Here we see we still have a heavy bias towards high \(q\)-values, so we need to see exactly what the issues are.
7.4 Assessing this Model
We now repeat our assessment approach for this new model, investigating our simulations by looking at transaction counts.
Show code
tnx_data_tbl <- obs_2022_stats_tbl |>
semi_join(pnbd_fixed3_validsims_tbl, by = "customer_id")
obs_customer_count <- tnx_data_tbl |> nrow()
obs_total_tnx_count <- tnx_data_tbl |> pull(tnx_count) |> sum()
pnbd_fixed3_tnx_simsumm_tbl <- pnbd_fixed3_validsims_tbl |>
mutate(data = map(sim_file, read_rds)) |>
unnest(data) |>
group_by(draw_id) |>
summarise(
.groups = "drop",
sim_customer_count = length(sim_tnx_count[sim_tnx_count > 0]),
sim_total_tnx_count = sum(sim_tnx_count)
)
ggplot(pnbd_fixed3_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_customer_count), binwidth = 1) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Simulated Customers With Transactions",
y = "Frequency",
title = "Histogram of Count of Customers Transacted",
subtitle = "Observed Count in Red"
)Show code
ggplot(pnbd_fixed3_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_total_tnx_count), binwidth = 5) +
geom_vline(aes(xintercept = obs_total_tnx_count), colour = "red") +
labs(
x = "Simulated Transaction Count",
y = "Frequency",
title = "Histogram of Count of Total Transaction Count",
subtitle = "Observed Count in Red"
)Overall, it looks like our choice of prior parameters has a big effect on our estimates, so we need to take this into account.
An improvement to this model would allow us to express uncertainty around these prior parameters, allowing those values themselves to have probability distributions, thereby enabling us to express our uncertainty more formally.
Building such ‘hierarchical models’ is the next step.
Before we move onto this though, we want to increase the data count, as this may help resolve some of the identification issues with the lifetime parts of the model.
8 R Environment
Show code
options(width = 120L)
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.2.2 (2022-10-31)
os Ubuntu 22.04.2 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Etc/UTC
date 2023-03-22
pandoc 2.19.2 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] RSPM (R 4.2.0)
arrayhelpers 1.1-0 2020-02-04 [1] RSPM (R 4.2.0)
backports 1.4.1 2021-12-13 [1] RSPM (R 4.2.0)
base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.2.0)
bayesplot * 1.10.0 2022-11-16 [1] RSPM (R 4.2.0)
bit 4.0.5 2022-11-15 [1] RSPM (R 4.2.0)
bit64 4.0.5 2020-08-30 [1] RSPM (R 4.2.0)
boot 1.3-28 2021-05-03 [2] CRAN (R 4.2.2)
bridgesampling 1.1-2 2021-04-16 [1] RSPM (R 4.2.0)
brms * 2.18.0 2022-09-19 [1] RSPM (R 4.2.0)
Brobdingnag 1.2-9 2022-10-19 [1] RSPM (R 4.2.0)
cachem 1.0.7 2023-02-24 [1] RSPM (R 4.2.0)
callr 3.7.3 2022-11-02 [1] RSPM (R 4.2.0)
checkmate 2.1.0 2022-04-21 [1] RSPM (R 4.2.0)
cli 3.6.0 2023-01-09 [1] RSPM (R 4.2.0)
cmdstanr * 0.5.3 2023-03-15 [1] Github (stan-dev/cmdstanr@22b391e)
coda 0.19-4 2020-09-30 [1] RSPM (R 4.2.0)
codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.2)
colorspace 2.1-0 2023-01-23 [1] RSPM (R 4.2.0)
colourpicker 1.2.0 2022-10-28 [1] RSPM (R 4.2.0)
conflicted * 1.2.0 2023-02-01 [1] RSPM (R 4.2.0)
cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.2.0)
crayon 1.5.2 2022-09-29 [1] RSPM (R 4.2.0)
crosstalk 1.2.0 2021-11-04 [1] RSPM (R 4.2.0)
data.table 1.14.8 2023-02-17 [1] RSPM (R 4.2.0)
digest 0.6.31 2022-12-11 [1] RSPM (R 4.2.0)
directlabels * 2021.1.13 2021-01-16 [1] RSPM (R 4.2.0)
distributional 0.3.1 2022-09-02 [1] RSPM (R 4.2.0)
dplyr * 1.1.0 2023-01-29 [1] RSPM (R 4.2.0)
DT 0.27 2023-01-17 [1] RSPM (R 4.2.0)
dygraphs 1.1.1.6 2018-07-11 [1] RSPM (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.2.0)
evaluate 0.20 2023-01-17 [1] RSPM (R 4.2.0)
fansi 1.0.4 2023-01-22 [1] RSPM (R 4.2.0)
farver 2.1.1 2022-07-06 [1] RSPM (R 4.2.0)
fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.2.0)
forcats * 1.0.0 2023-01-29 [1] RSPM (R 4.2.0)
fs * 1.6.1 2023-02-06 [1] RSPM (R 4.2.0)
furrr * 0.3.1 2022-08-15 [1] RSPM (R 4.2.0)
future * 1.32.0 2023-03-07 [1] RSPM (R 4.2.0)
gamm4 0.2-6 2020-04-03 [1] RSPM (R 4.2.0)
generics 0.1.3 2022-07-05 [1] RSPM (R 4.2.0)
ggdist 3.2.1 2023-01-18 [1] RSPM (R 4.2.0)
ggplot2 * 3.4.1 2023-02-10 [1] RSPM (R 4.2.0)
globals 0.16.2 2022-11-21 [1] RSPM (R 4.2.0)
glue * 1.6.2 2022-02-24 [1] RSPM (R 4.2.0)
gridExtra 2.3 2017-09-09 [1] RSPM (R 4.2.0)
gtable 0.3.1 2022-09-01 [1] RSPM (R 4.2.0)
gtools 3.9.4 2022-11-27 [1] RSPM (R 4.2.0)
hms 1.1.2 2022-08-19 [1] RSPM (R 4.2.0)
htmltools 0.5.4 2022-12-07 [1] RSPM (R 4.2.0)
htmlwidgets 1.6.1 2023-01-07 [1] RSPM (R 4.2.0)
httpuv 1.6.9 2023-02-14 [1] RSPM (R 4.2.0)
igraph 1.4.1 2023-02-24 [1] RSPM (R 4.2.0)
inline 0.3.19 2021-05-31 [1] RSPM (R 4.2.0)
jsonlite 1.8.4 2022-12-06 [1] RSPM (R 4.2.0)
knitr 1.42 2023-01-25 [1] RSPM (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] RSPM (R 4.2.0)
later 1.3.0 2021-08-18 [1] RSPM (R 4.2.0)
lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.2)
lifecycle 1.0.3 2022-10-07 [1] RSPM (R 4.2.0)
listenv 0.9.0 2022-12-16 [1] RSPM (R 4.2.0)
lme4 1.1-31 2022-11-01 [1] RSPM (R 4.2.0)
loo 2.5.1 2022-03-24 [1] RSPM (R 4.2.0)
lubridate * 1.9.2 2023-02-10 [1] RSPM (R 4.2.0)
magrittr * 2.0.3 2022-03-30 [1] RSPM (R 4.2.0)
markdown 1.5 2023-01-31 [1] RSPM (R 4.2.0)
MASS 7.3-58.1 2022-08-03 [2] CRAN (R 4.2.2)
Matrix 1.5-1 2022-09-13 [2] CRAN (R 4.2.2)
matrixStats 0.63.0 2022-11-18 [1] RSPM (R 4.2.0)
memoise 2.0.1 2021-11-26 [1] RSPM (R 4.2.0)
mgcv 1.8-41 2022-10-21 [2] CRAN (R 4.2.2)
mime 0.12 2021-09-28 [1] RSPM (R 4.2.0)
miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.2.0)
minqa 1.2.5 2022-10-19 [1] RSPM (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] RSPM (R 4.2.0)
mvtnorm 1.1-3 2021-10-08 [1] RSPM (R 4.2.0)
nlme 3.1-160 2022-10-10 [2] CRAN (R 4.2.2)
nloptr 2.0.3 2022-05-26 [1] RSPM (R 4.2.0)
parallelly 1.34.0 2023-01-13 [1] RSPM (R 4.2.0)
pillar 1.8.1 2022-08-19 [1] RSPM (R 4.2.0)
pkgbuild 1.4.0 2022-11-27 [1] RSPM (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.2.0)
plyr 1.8.8 2022-11-11 [1] RSPM (R 4.2.0)
posterior * 1.4.0 2023-02-22 [1] RSPM (R 4.2.0)
prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.2.0)
processx 3.8.0 2022-10-26 [1] RSPM (R 4.2.0)
projpred 2.4.0 2023-02-12 [1] RSPM (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] RSPM (R 4.2.0)
ps 1.7.2 2022-10-26 [1] RSPM (R 4.2.0)
purrr * 1.0.1 2023-01-10 [1] RSPM (R 4.2.0)
quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.2.0)
R6 2.5.1 2021-08-19 [1] RSPM (R 4.2.0)
Rcpp * 1.0.10 2023-01-22 [1] RSPM (R 4.2.0)
RcppParallel 5.1.7 2023-02-27 [1] RSPM (R 4.2.0)
readr * 2.1.4 2023-02-10 [1] RSPM (R 4.2.0)
reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.2.0)
rlang * 1.0.6 2022-09-24 [1] RSPM (R 4.2.0)
rmarkdown 2.20 2023-01-19 [1] RSPM (R 4.2.0)
rstan 2.21.8 2023-01-17 [1] RSPM (R 4.2.2)
rstantools 2.3.0 2023-03-09 [1] RSPM (R 4.2.0)
scales * 1.2.1 2022-08-20 [1] RSPM (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.2.0)
shiny 1.7.4 2022-12-15 [1] RSPM (R 4.2.0)
shinyjs 2.1.0 2021-12-23 [1] RSPM (R 4.2.0)
shinystan 2.6.0 2022-03-03 [1] RSPM (R 4.2.0)
shinythemes 1.2.0 2021-01-25 [1] RSPM (R 4.2.0)
StanHeaders 2.21.0-7 2020-12-17 [1] RSPM (R 4.2.0)
stringi 1.7.12 2023-01-11 [1] RSPM (R 4.2.0)
stringr * 1.5.0 2022-12-02 [1] RSPM (R 4.2.0)
svUnit 1.0.6 2021-04-19 [1] RSPM (R 4.2.0)
tensorA 0.36.2 2020-11-19 [1] RSPM (R 4.2.0)
threejs 0.3.3 2020-01-21 [1] RSPM (R 4.2.0)
tibble * 3.2.0 2023-03-08 [1] RSPM (R 4.2.0)
tidybayes * 3.0.3 2023-02-04 [1] RSPM (R 4.2.0)
tidyr * 1.3.0 2023-01-24 [1] RSPM (R 4.2.0)
tidyselect 1.2.0 2022-10-10 [1] RSPM (R 4.2.0)
tidyverse * 2.0.0 2023-02-22 [1] RSPM (R 4.2.0)
timechange 0.2.0 2023-01-11 [1] RSPM (R 4.2.0)
tzdb 0.3.0 2022-03-28 [1] RSPM (R 4.2.0)
utf8 1.2.3 2023-01-31 [1] RSPM (R 4.2.0)
vctrs 0.5.2 2023-01-23 [1] RSPM (R 4.2.0)
vroom 1.6.1 2023-01-22 [1] RSPM (R 4.2.0)
withr 2.5.0 2022-03-03 [1] RSPM (R 4.2.0)
xfun 0.37 2023-01-31 [1] RSPM (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] RSPM (R 4.2.0)
xts 0.13.0 2023-02-20 [1] RSPM (R 4.2.0)
yaml 2.3.7 2023-01-23 [1] RSPM (R 4.2.0)
zoo 1.8-11 2022-09-17 [1] RSPM (R 4.2.0)
[1] /usr/local/lib/R/site-library
[2] /usr/local/lib/R/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Show code
options(width = 80L)